Least Squares Temporal Difference Learning and Galerkin’s Method
نویسنده
چکیده
The problem of estimating the value function underlying a Markovian reward process is considered. As it is well known, the value function underlying a Markovian reward process satisfied a linear fixed point equation. One approach to learning the value function from finite data is to find a good approximation to the value function in a given (linear) subspace of the space of value functions. We review some of the issues that arise when following this approach, as well as some results that characterize the finite-sample performance of some of the algorithms. 1 Markovian Reward Processes Let X be a measurable space and consider a stochastic process (X0, R1, X1, R2, X2, . . .), where Xt ∈ X and Rt+1 ∈ R, t = 0, 1, 2, . . .. The process is called a Markovian Reward process if • (X0, X1, . . .) is a Markov process, and • for any t ≥ 0, given Xt, Xt+1 the distribution of Rt+1 is independent of the history of the process. Here, Xt is called the state of the system at time t, while Rt+1 is the reward associated to transitioning from Xt to Xt+1. We shall denote by P the Markovian kernel underlying the process: Thus, the distribution of (Xt+1, Rt+1) given Xt is given by P(·, ·|Xt), t = 0, 1, . . .. Fix the so-called discount factor 0 ≤ γ ≤ 1 and define the (total discounted) return associated to the process R = ∞ ∑
منابع مشابه
A Non-Parametric Approach to Dynamic Programming
In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a unified view of several well-known policy...
متن کاملSustainable ℓ2-regularized actor-critic based on recursive least-squares temporal difference learning
Least-squares temporal difference learning (LSTD) has been used mainly for improving the data efficiency of the critic in actor-critic (AC). However, convergence analysis of the resulted algorithms is difficult when policy is changing. In this paper, a new AC method is proposed based on LSTD under discount criterion. The method comprises two components as the contribution: (1) LSTD works in an ...
متن کاملEnsembles of extreme learning machine networks for value prediction
Value prediction is an important subproblem of several reinforcement learning (RL) algorithms. In a previous work, it has been shown that the combination of least-squares temporal-difference learning with ELM (extreme learning machine) networks is a powerful method for value prediction in continuous-state problems. This work proposes the use of ensembles to improve the approximation capabilitie...
متن کاملKernel Least-Squares Temporal Difference Learning
Kernel methods have attracted many research interests recently since by utilizing Mercer kernels, non-linear and non-parametric versions of conventional supervised or unsupervised learning algorithms can be implemented and usually better generalization abilities can be obtained. However, kernel methods in reinforcement learning have not been popularly studied in the literature. In this paper, w...
متن کاملLocally Weighted Least Squares Temporal Difference Learning
This paper introduces locally weighted temporal difference learning for evaluation of a class of policies whose value function is nonlinear in the state. Least squares temporal difference learning is used for training local models according to a distance metric in state-space. Empirical evaluations are reported demonstrating learning performance on a number of strongly non-linear value function...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011